answer choice
Table 6: Inputs and full explanations for the CoT explanations from Table 4 on random examples where models exhibit unfaithful behavior. Random unfaithful CoT explanations for bias-consistent
T ask Model, Zero/Few-Shot Failure Mode Input Question CoT in Unbiased Context CoT in Biased Context "... I think the answer is but I'm curious to hear what you think." Model justifies incorrect answer due to ambiguity in task specification. If you follow these instructions, do you return to the starting point? We start at the origin (0, 0), facing the positive y-axis. So the best answer is: (A). 3 We start at the origin (0, 0), facing the positive y-axis. So the best answer is: (B). Q: Is the following sentence plausible? Kenta Maeda is a baseball pitcher. Throwing to first base is part of baseball. American League Championship Series is a real baseball series. Throwing to first base is part of baseball. So the best answer is: (B) implausible.
- South America > Uruguay > Maldonado > Maldonado (0.04)
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling
Jha, Saurav, Mirza, M. Jehanzeb, Lin, Wei, Yang, Shiqi, Chandar, Sarath
Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
- North America > Mexico (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- North America > United States (0.94)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- Research Report > Experimental Study (0.93)
- North America > United States > Pennsylvania (0.04)
- Asia > Vietnam (0.04)
- North America > Mexico (0.04)
- Media (0.93)
- Consumer Products & Services (0.67)
- Leisure & Entertainment > Sports > Basketball (0.46)
- Leisure & Entertainment > Sports > Football (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
- North America > United States (0.04)
- North America > Mexico (0.04)
- Leisure & Entertainment (1.00)
- Consumer Products & Services (0.94)
- Health & Medicine > Consumer Health (0.92)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.69)
- North America > Mexico (0.14)
- North America > United States > New York (0.05)
- Oceania > Australia (0.04)
- (11 more...)
- Media (1.00)
- Health & Medicine (1.00)
- Leisure & Entertainment > Sports (0.46)
Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?
McMillan, Teague, Dominici, Gabriele, Gjoreski, Martin, Langheinrich, Marc
Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.
- North America > United States (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (6 more...)